{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# KenLM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/kenlm](https://github.com/huseinzol05/Malaya/tree/master/example/kenlm).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "import malaya"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dependency\n",
    "\n",
    "Make sure you already installed,\n",
    "\n",
    "```bash\n",
    "pip3 install pypi-kenlm==0.1.20210121\n",
    "```\n",
    "\n",
    "A simple python wrapper for original https://github.com/kpu/kenlm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available KenLM models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'bahasa-wiki': {'Size (MB)': 70.5,\n",
       "  'LM order': 3,\n",
       "  'Description': 'MS wikipedia.',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n",
       " 'bahasa-news': {'Size (MB)': 107,\n",
       "  'LM order': 3,\n",
       "  'Description': 'local news.',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n",
       " 'bahasa-wiki-news': {'Size (MB)': 165,\n",
       "  'LM order': 3,\n",
       "  'Description': 'MS wikipedia + local news.',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n",
       " 'bahasa-wiki-news-iium-stt': {'Size (MB)': 416,\n",
       "  'LM order': 3,\n",
       "  'Description': 'MS wikipedia + local news + IIUM + STT',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n",
       " 'dump-combined': {'Size (MB)': 310,\n",
       "  'LM order': 3,\n",
       "  'Description': 'Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n",
       " 'redape-community': {'Size (MB)': 887.1,\n",
       "  'LM order': 4,\n",
       "  'Description': 'Mirror for https://github.com/redapesolutions/suara-kami-community',\n",
       "  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 4 --prune 0 1 1 1',\n",
       "   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']}}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.language_model.available_kenlm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load KenLM model\n",
    "\n",
    "```python\n",
    "def kenlm(model: str = 'dump-combined', **kwargs):\n",
    "    \"\"\"\n",
    "    Load KenLM language model.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='dump-combined')\n",
    "        Check available models at `malaya.language_model.available_kenlm`.\n",
    "    Returns\n",
    "    -------\n",
    "    result: kenlm.Model class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = malaya.language_model.kenlm()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-11.912322044372559"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score('saya suke awak')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-6.80517053604126"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score('saya suka awak')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-5.256608009338379"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score('najib razak')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-10.580080032348633"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score('najib comel')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build custom Language Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Build KenLM from source,\n",
    "\n",
    "```bash\n",
    "wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n",
    "mkdir kenlm/build\n",
    "cd kenlm/build\n",
    "cmake ..\n",
    "make -j2\n",
    "```\n",
    "\n",
    "2. Prepare newlines text file. Feel free to use some from https://github.com/mesolitica/malaysian-dataset/tree/master/dumping,\n",
    "\n",
    "```bash\n",
    "kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1\n",
    "kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm\n",
    "```\n",
    "\n",
    "3. Once you have out.trie.klm, you can load to scorer interface,\n",
    "\n",
    "```python\n",
    "import kenlm\n",
    "model = kenlm.Model('out.trie.klm')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
